by TONG LI
This data set contains 113,937 loans with 81 variables on each loan, including loan amount, borrower rate (or interest rate), current loan status, borrower income, borrower employment status, borrower credit history, and the latest payment information.
In this project, I am going to explore the relationship between the loan amount (LoanOriginalAmount in the dataset) and some other variables selected from the original dataset.
Since the dataset contains many variables, before doing any data analysis I selected 10-15 variables I would like to explore. First, I excluded all variables with proportion of missing values larger than 10%.
## [1] 113937 81
Second, in the variables with missing values fewer than 10%, I selected all the numeric variables and explore their correlation with LoanOriginalAmount. Because there are too many variables and an ordinary correlation matrix will be too big (47 by 47), I will use a flattened correlation table instead. This table contains four columns: (variable) i, (variable) j, cor, & p. Each row contains the correlation coefficient and the p value of one pair of the numeric variables.
By doing the above steps, in addition to LoanOriginalAmount, I selected 6 numeric variables whose correlation magnitude with LoanOriginalAmount were approximately between 0.3 and 0.9.
Next, I selected appropriate factors by plotting their distributions and excluding the ones with extreme ceiling or floor.
Four factors were selected.
Then the 7 numeric variales and 4 factors were selected from the dataset. A new variable “LoanOriginalQuarter” was created from the original variable “LoanOriginationQuarter”, with only Q1, Q2, Q3 and Q4 information in the new variable but not the specific years.
## 'data.frame': 113937 obs. of 11 variables:
## $ LoanOriginalAmount : int 9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
## $ MonthlyLoanPayment : num 330 319 123 321 564 ...
## $ Investors : int 258 1 41 158 20 1 1 1 1 1 ...
## $ LP_CustomerPayments : num 11396 0 4187 5143 2820 ...
## $ LP_ServiceFees : num -133.2 0 -24.2 -108 -60.3 ...
## $ Term : int 36 36 36 36 36 60 36 36 36 36 ...
## $ CreditScoreRangeLower: int 640 680 480 800 680 740 680 700 820 820 ...
## $ IncomeRange : Factor w/ 8 levels "$0","$1-24,999",..: 4 5 7 4 3 3 4 4 4 4 ...
## $ IsBorrowerHomeowner : Factor w/ 2 levels "False","True": 2 1 1 2 2 2 1 1 2 2 ...
## $ LoanOriginalQuarter : Factor w/ 4 levels "Q1","Q2","Q3",..: 3 1 1 4 3 4 2 2 4 4 ...
## $ EmploymentStatus : Factor w/ 8 levels "Employed","Full-time",..: 8 1 3 1 1 1 1 1 1 1 ...
## LoanOriginalAmount MonthlyLoanPayment Investors
## Min. : 1000 Min. : 0.0 Min. : 1.00
## 1st Qu.: 4000 1st Qu.: 131.6 1st Qu.: 2.00
## Median : 6500 Median : 217.7 Median : 44.00
## Mean : 8337 Mean : 272.5 Mean : 80.48
## 3rd Qu.:12000 3rd Qu.: 371.6 3rd Qu.: 115.00
## Max. :35000 Max. :2251.5 Max. :1189.00
##
## LP_CustomerPayments LP_ServiceFees Term
## Min. : -2.35 Min. :-664.87 Min. :12.00
## 1st Qu.: 1005.76 1st Qu.: -73.18 1st Qu.:36.00
## Median : 2583.83 Median : -34.44 Median :36.00
## Mean : 4183.08 Mean : -54.73 Mean :40.83
## 3rd Qu.: 5548.40 3rd Qu.: -13.92 3rd Qu.:36.00
## Max. :40702.39 Max. : 32.06 Max. :60.00
##
## CreditScoreRangeLower IncomeRange IsBorrowerHomeowner
## Min. : 0.0 $25,000-49,999:32192 False:56459
## 1st Qu.:660.0 $50,000-74,999:31050 True :57478
## Median :680.0 $100,000+ :17337
## Mean :685.6 $75,000-99,999:16916
## 3rd Qu.:720.0 Not displayed : 7741
## Max. :880.0 $1-24,999 : 7274
## NA's :591 (Other) : 1427
## LoanOriginalQuarter EmploymentStatus
## Q1:29678 Employed :67322
## Q2:24906 Full-time :26355
## Q3:27967 Self-employed: 6134
## Q4:31386 Not available: 5347
## Other : 3806
## (Other) : 2718
## NA's : 2255
Although Term is a numeric variable, it actually only has three levels: 12, 36, and 60. Therefore I plot Term again as a factor
All those numeric variables have skewed distributions (except for Term), and they are also on very different scales. Therefore they need to be standardized and transformed. Before doing the log transformation for each variable, an appropriate integer will be added to each datum based on the range of the standardized values to avoid NAN generated by zero values. The new distributions (after being standardized and log transformed) are as follows:
## LoanOriginalAmount MonthlyLoanPayment Investors
## Min. :-1.1747 Min. :-1.4140 Min. :-0.7698
## 1st Qu.:-0.6944 1st Qu.:-0.7310 1st Qu.:-0.7601
## Median :-0.2941 Median :-0.2841 Median :-0.3533
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5865 3rd Qu.: 0.5143 3rd Qu.: 0.3344
## Max. : 4.2689 Max. :10.2701 Max. :10.7375
##
## LP_CustomerPayments LP_ServiceFees CreditScoreRangeLower
## Min. :-0.8736 Min. :-10.0559 Min. :-10.3158
## 1st Qu.:-0.6632 1st Qu.: -0.3041 1st Qu.: -0.3847
## Median :-0.3338 Median : 0.3343 Median : -0.0838
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.2850 3rd Qu.: 0.6725 3rd Qu.: 0.5181
## Max. : 7.6226 Max. : 1.4303 Max. : 2.9256
## NA's :591
The order of the IncomeRange levels needs to be rearranged. In addition, the labels on the x axis were too long and some were overlapped. After adjusted it was plotted again:
The types of Employment Status were confusing: ‘Full-time’, ‘Part-time’ and ‘Self-employed’ should all be considered as ‘Employed’. Therefore, another variable was created in which all the three types were labeled as ‘Employed’ as well.
However, since the loans with a status of ‘Employed’ were much more than the loans with other kinds of status, the Modified Employment Status was probably not a very good variable, if I want to explore its relationship with LoanOriginalAmount.
There are 113,937 records in my dataset, with 11 variables. Seven variables are numeric variables, including LoanOriginalAmount, MonthlyLoanPayment, Investors, LP_CustomerPayments, LP_ServiceFees, Term, and CreditScoreRangeLower. Four variables are factors; their names and levels are as follows:
IncomeRange: Not employed, $0, $1-24,999, $25,000-49,999, $50,000-74,999, $75,000-99,999, $100,000+, Not displayed IsBorrowerHomeowner: Ture, False
LoanOriginalQuarter: Q1, Q2, Q3, Q4
EmploymentStatus2: Employed, Retired, Not employed, Other, Not available, NA (created based on EmploymentStatus)
Other observations:
1. The numeric variables are correlated with LoanOriginalAmount, with the magnitude of correlation approximately between 0.3 and 0.9.
2. The distribution of most numeric variables are not normal, and are on very different scales. Therefore, standardization and log transformation were used on LoanOriginalAmount, MonthlyLoanPayment, Investors, LP_CustomerPayments LP_ServiceFees, and CreditScoreRangeLower.
3. In the modified Employment Status (EmploymentStatus2), the ‘Employed’ status takes a proportion of more than 80%, therefore it might not be a very good predictor.
The main feature of interest is the origination amount of the loan (LoanOriginalAmount). I’d like to see which features are best for predicting the original amount of loan. I have already selected numeric variables which are moderately correlated to LoanOriginalAmount, and factors without extreme distributions, and I want to see whether some combination of these variables can be predict the original amount of loan.
Two new variables were created:
1. LoanOriginalQuarter: created from LoanOriginateQuarter. The new variable only contains in which quarter (Q1, Q2, Q3 or Q4) the loan was made, without the year information.
2. EmploymentStatus2: created from EmploymentStatus. The original variable includes ‘Employed’, ‘Full-time’, ‘Part-time’, ‘Self-employed’ and several other levels. However this is confusing because obviously ‘Employed’ should include the latter three levels. Therefore in the new variable, the level ‘Full-time’, ‘Part-time’ and ‘Self-employed’ were all relabeled as ‘Employed’.
From now on, I will use the new data set created from the previous step which contains transformed numeric variables and adjusted factors.
Generally explore the relationship between the variables in the dataset:
After standardization and log transformation, some correlations were different from the correlation with original variables. Next, I will used plots to show the relationship between the transformed variables:
Although the correlation between LoanOriginalAmount and MonthlyLoanPayment is high (above 0.9), this plot shows that the relationship between the two variables is probably moderated by a third variable. It looks that three regression lines are needed here to capture the linear relationship between LoanOriginalAmount and MonthlyLoanPayment.
This plot shows a weak positive correlation between the two variables.
This plot shows a weak-to-medium correlation between the two variables.
Scatter plot to show the relationship between ServiceFees and LoanOriginalAmount:
This plot shows a medium negative correlation between the two variables.
This plot shows a positive medium correlation between the two variables. It looks that the correlation was affected by extreme values.
This plot shows that the amount of loan for 12, 36 and 60 months of term was different, with more amount corresponding to longer term.
To explore the Loan Original Amount between different lengths of Term:
## Df Sum Sq Mean Sq F value Pr(>F)
## Term 2 605 302.75 8637 <2e-16 ***
## Residuals 113934 3994 0.04
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Pairwise comparisons using t tests with pooled SD
##
## data: newdata$LoanOriginalAmount and newdata$Term
##
## 12 36
## 36 <2e-16 -
## 60 <2e-16 <2e-16
##
## P value adjustment method: bonferroni
So ANOVA test shows that Loan Amount significantly increases with Term length.
The general trend was that with higher Income, the mean and variance of loan amount was also higher. The exception was at the income of 0. This might be because people receive 0 in their income for different reasons (temparory job, internship, etc.), so for the zero-income people their loan amount may be more likely to depend on other variables.
To explore the difference of loan amount by income range:
## Df Sum Sq Mean Sq F value Pr(>F)
## IncomeRange 7 837 119.54 3620 <2e-16 ***
## Residuals 113929 3762 0.03
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Pairwise comparisons using t tests with pooled SD
##
## data: newdata$LoanOriginalAmount and newdata$IncomeRange
##
## Not employed $0 $1-24,999 $25,000-49,999
## $0 1.1e-12 - - -
## $1-24,999 0.0084 < 2e-16 - -
## $25,000-49,999 < 2e-16 0.7689 < 2e-16 -
## $50,000-74,999 < 2e-16 < 2e-16 < 2e-16 < 2e-16
## $75,000-99,999 < 2e-16 < 2e-16 < 2e-16 < 2e-16
## $100,000+ < 2e-16 < 2e-16 < 2e-16 < 2e-16
## Not displayed 1.0000 < 2e-16 8.7e-14 < 2e-16
## $50,000-74,999 $75,000-99,999 $100,000+
## $0 - - -
## $1-24,999 - - -
## $25,000-49,999 - - -
## $50,000-74,999 - - -
## $75,000-99,999 < 2e-16 - -
## $100,000+ < 2e-16 < 2e-16 -
## Not displayed < 2e-16 < 2e-16 < 2e-16
##
## P value adjustment method: bonferroni
The ANOVA test shows that from $1-24,999 above, loan amount increases with income range.
The loan amount for home owners looks larger than that for non home owners, and larger variance of loan amount was found for home owners.
##
## Welch Two Sample t-test
##
## data: newdata$LoanOriginalAmount by newdata$IsBorrowerHomeowner
## t = -71.593, df = 113170, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.08558477 -0.08102359
## sample estimates:
## mean in group False mean in group True
## 0.2107919 0.2940961
The independent t test confirms that overall, home owners have larger loan amount.
The loan amount in different quarter was not very different from one another.
## Df Sum Sq Mean Sq F value Pr(>F)
## LoanOriginalQuarter 3 54 17.84 447.2 <2e-16 ***
## Residuals 113933 4546 0.04
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Pairwise comparisons using t tests with pooled SD
##
## data: newdata$LoanOriginalAmount and newdata$LoanOriginalQuarter
##
## Q1 Q2 Q3
## Q2 <2e-16 - -
## Q3 <2e-16 0.032 -
## Q4 <2e-16 <2e-16 <2e-16
##
## P value adjustment method: bonferroni
## # A tibble: 4 x 3
## LoanOriginalQuarter mean_amount n
## <fctr> <dbl> <int>
## 1 Q1 0.2804641 29678
## 2 Q2 0.2279122 24906
## 3 Q3 0.2327582 27967
## 4 Q4 0.2643094 31386
The AVOVA test shows difference of loan amount among quarters. To be specific, loan amount in Q1 > Q4 > Q3 > Q2.
From this plot, the loan amount of employed borrowers has larger variance and higher mean compared to that of borrowers with other employment status.
## Df Sum Sq Mean Sq F value Pr(>F)
## EmploymentStatus2 4 119 29.858 762 <2e-16 ***
## Residuals 111677 4376 0.039
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 2255 observations deleted due to missingness
##
## Pairwise comparisons using t tests with pooled SD
##
## data: newdata$LoanOriginalAmount and newdata$EmploymentStatus2
##
## Employed Retired Not employed Other
## Retired <2e-16 - - -
## Not employed <2e-16 1.00 - -
## Other <2e-16 <2e-16 <2e-16 -
## Not available <2e-16 0.54 1.00 <2e-16
##
## P value adjustment method: bonferroni
So employed borrowers tend to have larger amount of loan compared to borrowers with other employment status. No difference is found between retired and not employed borrowers.
In addition to the main variable (LoanOriginalAmount) that I focused on, I also looked into the relationship between other variables, and found a relationship between pre charge-off cumulative gross payments made by the borrower on the loan (LP_CustomerPayments) and Cumulative service fees paid by the investors who have invested in the loan (LP_ServiceFees).
Observed relationships are as follows:
1. The plots revealed a strong positive correlation between LoanOriginalAmount and MonthlyLoanPayment, which means that people with larger amount of original loan tend to schedule larger amount of monthly loan payment.
2. A weak positive correlation was found between LoanOriginalAmount and the number of investors that funded the loan, which shows that the amount of loan tends to be larger with more investors.
3. A weak positive correlation was found between LoanOriginalAmount and pre charge-off cumulative gross payments made by the borrower (LP_CustomerPayments), which means that the amount of loan tends to be larger if the borrower has made higher pre charge-off payment.
4. A medium negative correlation was found between LoanOriginalAmount and cumulative service fees paid by the investors who have invested in the loan, which means that the amount of loan tends to be larger if the investors have paid more service fees.
5. A weak positive correlation was found between LoanOriginalAmount and the lower value representing the range of the borrower’s credit score as provided by a consumer credit rating agency. It means that borrowers with higher credit score tend to have larger amount of loan.
6. Larger amount of loan tends to have longer term.
7. Borrowers with their own home tend to have larger amount of loan.
8. In which quarter the loan was created seems to affect the amount of loan.
9. In general, more income is related to larger amount of loan.
10. Borrowers who are employed tend to have larger amount of loan.
An interesting relationship is that there is a relatively strong negative correlation between pre charge-off cumulative gross payments made by the borrower on the loan and cumulative service fees paid by the investors who have invested in the loan. Therefore, if the borrower pay more for the pre charge-off fees, the investors will pay less for the service fees.
The strongest relationship was the correlation between LoanOriginalAmount and MonthlyLoanPayment.
The scatter plot of MonthlyLoanPayment and LoanOriginalAmount showed that three lines were possibly needed to capture the linear relationship between the two variables. Since Term was the only variable with three levels, I first investigated whether different levels of Term would affect the relationship between the two variables.
From the plot, the correlation between LoanOriginalAmount and MonthlyLoanPayment exhibited very good linear relationship on each level of Term. Therefore there should be a interaction between Term and MonthlyLoanPayment when predicting LoanOriginalAmount.
To see whether the relationship between Investors and LoanOriginalAmount is moderated by other variables, I tried several moderators and the plots were as follows:
From the above plots, the relationship between Investors and LoanOriginalAmount might be moderated by Term and IncomeRange. Other variables did not show a strong moderating effect.
To see whether the relationship between LP_CustomerPayments and LoanOriginalAmount is moderated by other variables:
Although in some of the above plots, the regression lines on different levels of a thid variable showed different slopes, more careful examination of the patterns of the dots suggested that most of the different slopes were probably driven by some outliers, or the extreme unbalance between the number of datum in each condition, but not the difference in the patterns of the relationship. Therefore, the only variables was found to moderate the relationship between LP_CustomerPayments and LoanOriginalAmount was LoanOriginalQuarter.
To see whether the relationship between LP_ServiceFees and LoanOriginalAmount is moderated by other variables:
This situation was similar to the previous one. No variable was found to moderate the relationship between LP_ServiceFees and LoanOriginalAmount.
To see whether the relationship between CreditScoreRangeLower and LoanOriginalAmount was moderated by other variables:
The situation looked like the ones with LP_ServiceFees, no other variable was found to moderate the relationship between CreditScoreRangeLower and LoanOriginalAmount.
It seems that there is an interaction between Term and whether the borrower is a home owner.
It seems that the length of term did not affect the relationship between income range and loan amount.
Again, the length of term did not affect the relationship between loan amount and in which quarter the loan was created.
It seems that whether the borrower is a home owner did not affect the relationship between the income range and the amount of loan.
It seems that in which quarter the loan was created did not affect the relationship between borrowers’ income range and the amount of loan.
The observed moderating effects/interactions are as follows:
1. The relationship between loan amount and monthly payment was moderated by the term of the loan. With longer loan term, larger amount of loan tend to relate even more monthly payment.
2. The relationship between loan amount and number of investors was moderated by term length and borrowers’ income. In general, more investors is related to larger amount of loan. With longer term, more investors is still related to larger amount of loan, but not as large as that with shorter term. On the other hand, with higher income, more investors is also related to larger amount of loan but the amount is not that large as for borrowers with less income.
3. An interaction was found between cumulative service fees paid by the investors who had invested in the loan and in which quarter the loan was created.
4. An interaction was found between loan term and whether the borrower was a home owner.
Based on the univariate, bivariate and mutivariate analysis, I’m going to create a multiple regression model to predict original loan amount, including the variables that are correlated to LoanOriginalAmount, and the interactions found from the plots. I will add the variables to the model one after another, and then add interactions, meanwhile comparing the models using ANOVA. One noteworthy thing is that although EmploymentStatus2 seems to affect loan amount, more than 88% of the employment status was employed, therefore the unbalance between different employment status makes it not a good predictor. Therefore EmploymentStatus2 will not be added in the model.
there is an apparant interaction between IsBorrowerHomeowner and Term, I will not include this in the model, for the variable IsBorrowerHomeowner itself seems not to be a good predictor based on other plots, therefore the interaction should not be included. The situation with Quarter is similar: although an interaction between LP_ServiceFees and LoanOriginalQuarter was found, based on the bivariate plot, Quarter does not affect loan amount, therefore quarter and its interaction will not be included in the model.
## Analysis of Variance Table
##
## Model 1: LoanOriginalAmount ~ MonthlyLoanPayment
## Model 2: LoanOriginalAmount ~ MonthlyLoanPayment + Investors
## Model 3: LoanOriginalAmount ~ MonthlyLoanPayment + Investors + LP_CustomerPayments
## Model 4: LoanOriginalAmount ~ MonthlyLoanPayment + Investors + LP_CustomerPayments +
## LP_ServiceFees
## Model 5: LoanOriginalAmount ~ MonthlyLoanPayment + Investors + LP_CustomerPayments +
## LP_ServiceFees + Term
## Model 6: LoanOriginalAmount ~ MonthlyLoanPayment + Investors + LP_CustomerPayments +
## LP_ServiceFees + Term + CreditScoreRangeLower
## Model 7: LoanOriginalAmount ~ MonthlyLoanPayment + Investors + LP_CustomerPayments +
## LP_ServiceFees + Term + CreditScoreRangeLower + IncomeRange
## Model 8: LoanOriginalAmount ~ MonthlyLoanPayment + Investors + LP_CustomerPayments +
## LP_ServiceFees + Term + CreditScoreRangeLower + IncomeRange +
## IsBorrowerHomeowner
## Model 9: LoanOriginalAmount ~ MonthlyLoanPayment + Investors + LP_CustomerPayments +
## LP_ServiceFees + Term + CreditScoreRangeLower + IncomeRange +
## IsBorrowerHomeowner + LoanOriginalQuarter
## Model 10: LoanOriginalAmount ~ MonthlyLoanPayment + Investors + LP_CustomerPayments +
## LP_ServiceFees + Term + CreditScoreRangeLower + IncomeRange +
## IsBorrowerHomeowner + LoanOriginalQuarter + MonthlyLoanPayment:Term
## Model 11: LoanOriginalAmount ~ MonthlyLoanPayment + Investors + LP_CustomerPayments +
## LP_ServiceFees + Term + CreditScoreRangeLower + IncomeRange +
## IsBorrowerHomeowner + LoanOriginalQuarter + MonthlyLoanPayment:Term +
## Investors:IncomeRange
## Model 12: LoanOriginalAmount ~ MonthlyLoanPayment + Investors + LP_CustomerPayments +
## LP_ServiceFees + Term + CreditScoreRangeLower + IncomeRange +
## IsBorrowerHomeowner + LoanOriginalQuarter + MonthlyLoanPayment:Term +
## Investors:IncomeRange + Investors:Term
## Model 13: LoanOriginalAmount ~ MonthlyLoanPayment + Investors + LP_CustomerPayments +
## LP_ServiceFees + Term + CreditScoreRangeLower + IncomeRange +
## IsBorrowerHomeowner + LoanOriginalQuarter + MonthlyLoanPayment:Term +
## Investors:IncomeRange + Investors:Term + Term:IsBorrowerHomeowner
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 111678 466.18
## 2 111677 466.13 1 0.048 32.9398 9.530e-09 ***
## 3 111676 460.89 1 5.239 3573.5462 < 2.2e-16 ***
## 4 111675 423.39 1 37.501 25580.6246 < 2.2e-16 ***
## 5 111673 227.10 2 196.287 66947.4375 < 2.2e-16 ***
## 6 111672 221.40 1 5.701 3888.8565 < 2.2e-16 ***
## 7 111665 213.76 7 7.646 745.1363 < 2.2e-16 ***
## 8 111664 213.67 1 0.087 59.3399 1.337e-14 ***
## 9 111661 212.73 3 0.941 214.0580 < 2.2e-16 ***
## 10 111659 169.40 2 43.323 14776.1865 < 2.2e-16 ***
## 11 111652 169.34 7 0.061 5.9451 6.184e-07 ***
## 12 111650 163.77 2 5.577 1902.2993 < 2.2e-16 ***
## 13 111648 163.67 2 0.092 31.5033 2.100e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Calls:
## m1: lm(formula = LoanOriginalAmount ~ MonthlyLoanPayment, data = newdata_comp)
## m2: lm(formula = LoanOriginalAmount ~ MonthlyLoanPayment + Investors,
## data = newdata_comp)
## m3: lm(formula = LoanOriginalAmount ~ MonthlyLoanPayment + Investors +
## LP_CustomerPayments, data = newdata_comp)
## m4: lm(formula = LoanOriginalAmount ~ MonthlyLoanPayment + Investors +
## LP_CustomerPayments + LP_ServiceFees, data = newdata_comp)
##
## ===========================================================================
## m1 m2 m3 m4
## ---------------------------------------------------------------------------
## (Intercept) 0.016*** 0.016*** 0.012*** 0.616***
## (0.000) (0.000) (0.000) (0.006)
## MonthlyLoanPayment 0.938*** 0.938*** 0.946*** 0.910***
## (0.001) (0.001) (0.001) (0.001)
## Investors 0.002*** 0.013*** 0.006***
## (0.001) (0.001) (0.001)
## LP_CustomerPayments -0.021*** -0.056***
## (0.001) (0.001)
## LP_ServiceFees -0.579***
## (0.006)
## ---------------------------------------------------------------------------
## R-squared 0.896 0.896 0.897 0.906
## adj. R-squared 0.896 0.896 0.897 0.906
## sigma 0.065 0.065 0.064 0.062
## F 965265.838 482684.380 325867.434 268518.131
## p 0.000 0.000 0.000 0.000
## Log-likelihood 147470.456 147476.241 148107.369 152846.341
## Deviance 466.179 466.131 460.892 423.391
## AIC -294934.913 -294944.482 -296204.738 -305680.683
## BIC -294906.043 -294905.988 -296156.621 -305622.943
## N 111680 111680 111680 111680
## ===========================================================================
##
## Calls:
## m4: lm(formula = LoanOriginalAmount ~ MonthlyLoanPayment + Investors +
## LP_CustomerPayments + LP_ServiceFees, data = newdata_comp)
## m5: lm(formula = LoanOriginalAmount ~ MonthlyLoanPayment + Investors +
## LP_CustomerPayments + LP_ServiceFees + Term, data = newdata_comp)
## m6: lm(formula = LoanOriginalAmount ~ MonthlyLoanPayment + Investors +
## LP_CustomerPayments + LP_ServiceFees + Term + CreditScoreRangeLower,
## data = newdata_comp)
## m7: lm(formula = LoanOriginalAmount ~ MonthlyLoanPayment + Investors +
## LP_CustomerPayments + LP_ServiceFees + Term + CreditScoreRangeLower +
## IncomeRange, data = newdata_comp)
##
## ================================================================================================
## m4 m5 m6 m7
## ------------------------------------------------------------------------------------------------
## (Intercept) 0.616*** 0.047*** -0.103*** -0.129***
## (0.006) (0.005) (0.006) (0.006)
## MonthlyLoanPayment 0.910*** 0.895*** 0.887*** 0.874***
## (0.001) (0.001) (0.001) (0.001)
## Investors 0.006*** 0.019*** 0.017*** 0.017***
## (0.001) (0.000) (0.000) (0.000)
## LP_CustomerPayments -0.056*** -0.021*** -0.021*** -0.023***
## (0.001) (0.001) (0.000) (0.000)
## LP_ServiceFees -0.579*** -0.200*** -0.201*** -0.210***
## (0.006) (0.004) (0.004) (0.004)
## Term: 36/12 0.170*** 0.172*** 0.171***
## (0.001) (0.001) (0.001)
## Term: 60/12 0.262*** 0.262*** 0.262***
## (0.001) (0.001) (0.001)
## CreditScoreRangeLower 0.144*** 0.161***
## (0.003) (0.003)
## IncomeRange: $0/Not employed 0.025***
## (0.002)
## IncomeRange: $1-24,999/Not employed 0.010***
## (0.002)
## IncomeRange: $25,000-49,999/Not employed 0.014***
## (0.002)
## IncomeRange: $50,000-74,999/Not employed 0.023***
## (0.002)
## IncomeRange: $75,000-99,999/Not employed 0.028***
## (0.002)
## IncomeRange: $100,000+/Not employed 0.034***
## (0.002)
## IncomeRange: Not displayed/Not employed 0.040***
## (0.002)
## ------------------------------------------------------------------------------------------------
## R-squared 0.906 0.949 0.951 0.952
## adj. R-squared 0.906 0.949 0.951 0.952
## sigma 0.062 0.045 0.045 0.044
## F 268518.131 349814.089 307969.395 159768.359
## p 0.000 0.000 0.000 0.000
## Log-likelihood 152846.341 187628.478 189048.125 191010.735
## Deviance 423.391 227.104 221.403 213.756
## AIC -305680.683 -375240.957 -378078.250 -381989.471
## BIC -305622.943 -375163.970 -377991.639 -381835.497
## N 111680 111680 111680 111680
## ================================================================================================
##
## Calls:
## m7: lm(formula = LoanOriginalAmount ~ MonthlyLoanPayment + Investors +
## LP_CustomerPayments + LP_ServiceFees + Term + CreditScoreRangeLower +
## IncomeRange, data = newdata_comp)
## m8: lm(formula = LoanOriginalAmount ~ MonthlyLoanPayment + Investors +
## LP_CustomerPayments + LP_ServiceFees + Term + CreditScoreRangeLower +
## IncomeRange + IsBorrowerHomeowner, data = newdata_comp)
## m9: lm(formula = LoanOriginalAmount ~ MonthlyLoanPayment + Investors +
## LP_CustomerPayments + LP_ServiceFees + Term + CreditScoreRangeLower +
## IncomeRange + IsBorrowerHomeowner + LoanOriginalQuarter,
## data = newdata_comp)
## m10: lm(formula = LoanOriginalAmount ~ MonthlyLoanPayment + Investors +
## LP_CustomerPayments + LP_ServiceFees + Term + CreditScoreRangeLower +
## IncomeRange + IsBorrowerHomeowner + LoanOriginalQuarter +
## MonthlyLoanPayment:Term, data = newdata_comp)
##
## ================================================================================================
## m7 m8 m9 m10
## ------------------------------------------------------------------------------------------------
## (Intercept) -0.129*** -0.126*** -0.118*** -0.034***
## (0.006) (0.006) (0.006) (0.005)
## MonthlyLoanPayment 0.874*** 0.874*** 0.871*** 0.382***
## (0.001) (0.001) (0.001) (0.004)
## Investors 0.017*** 0.017*** 0.016*** 0.018***
## (0.000) (0.000) (0.000) (0.000)
## LP_CustomerPayments -0.023*** -0.023*** -0.020*** -0.017***
## (0.000) (0.000) (0.001) (0.000)
## LP_ServiceFees -0.210*** -0.210*** -0.211*** -0.137***
## (0.004) (0.004) (0.004) (0.004)
## Term: 36/12 0.171*** 0.171*** 0.172*** 0.015***
## (0.001) (0.001) (0.001) (0.002)
## Term: 60/12 0.262*** 0.262*** 0.263*** 0.056***
## (0.001) (0.001) (0.001) (0.002)
## CreditScoreRangeLower 0.161*** 0.158*** 0.156*** 0.154***
## (0.003) (0.003) (0.003) (0.003)
## IncomeRange: $0/Not employed 0.025*** 0.025*** 0.025*** 0.024***
## (0.002) (0.002) (0.002) (0.002)
## IncomeRange: $1-24,999/Not employed 0.010*** 0.010*** 0.010*** 0.010***
## (0.002) (0.002) (0.002) (0.001)
## IncomeRange: $25,000-49,999/Not employed 0.014*** 0.014*** 0.013*** 0.014***
## (0.002) (0.002) (0.002) (0.001)
## IncomeRange: $50,000-74,999/Not employed 0.023*** 0.022*** 0.022*** 0.020***
## (0.002) (0.002) (0.002) (0.001)
## IncomeRange: $75,000-99,999/Not employed 0.028*** 0.027*** 0.027*** 0.024***
## (0.002) (0.002) (0.002) (0.001)
## IncomeRange: $100,000+/Not employed 0.034*** 0.033*** 0.033*** 0.030***
## (0.002) (0.002) (0.002) (0.001)
## IncomeRange: Not displayed/Not employed 0.040*** 0.039*** 0.037*** 0.035***
## (0.002) (0.002) (0.002) (0.001)
## IsBorrowerHomeowner: True/False 0.002*** 0.002*** 0.001***
## (0.000) (0.000) (0.000)
## LoanOriginalQuarter: Q2/Q1 -0.008*** -0.007***
## (0.000) (0.000)
## LoanOriginalQuarter: Q3/Q1 -0.007*** -0.007***
## (0.000) (0.000)
## LoanOriginalQuarter: Q4/Q1 -0.003*** -0.004***
## (0.000) (0.000)
## MonthlyLoanPayment x Term: 36/12 0.488***
## (0.004)
## MonthlyLoanPayment x Term: 60/12 0.650***
## (0.004)
## ------------------------------------------------------------------------------------------------
## R-squared 0.952 0.952 0.953 0.962
## adj. R-squared 0.952 0.952 0.953 0.962
## sigma 0.044 0.044 0.044 0.039
## F 159768.359 149179.540 124890.536 142572.058
## p 0.000 0.000 0.000 0.000
## Log-likelihood 191010.735 191033.465 191280.036 203996.114
## Deviance 213.756 213.669 212.728 169.405
## AIC -381989.471 -382032.930 -382520.073 -407948.228
## BIC -381835.497 -381869.332 -382327.605 -407736.513
## N 111680 111680 111680 111680
## ================================================================================================
##
## Calls:
## m10: lm(formula = LoanOriginalAmount ~ MonthlyLoanPayment + Investors +
## LP_CustomerPayments + LP_ServiceFees + Term + CreditScoreRangeLower +
## IncomeRange + IsBorrowerHomeowner + LoanOriginalQuarter +
## MonthlyLoanPayment:Term, data = newdata_comp)
## m11: lm(formula = LoanOriginalAmount ~ MonthlyLoanPayment + Investors +
## LP_CustomerPayments + LP_ServiceFees + Term + CreditScoreRangeLower +
## IncomeRange + IsBorrowerHomeowner + LoanOriginalQuarter +
## MonthlyLoanPayment:Term + Investors:IncomeRange, data = newdata_comp)
## m12: lm(formula = LoanOriginalAmount ~ MonthlyLoanPayment + Investors +
## LP_CustomerPayments + LP_ServiceFees + Term + CreditScoreRangeLower +
## IncomeRange + IsBorrowerHomeowner + LoanOriginalQuarter +
## MonthlyLoanPayment:Term + Investors:IncomeRange + Investors:Term,
## data = newdata_comp)
## m13: lm(formula = LoanOriginalAmount ~ MonthlyLoanPayment + Investors +
## LP_CustomerPayments + LP_ServiceFees + Term + CreditScoreRangeLower +
## IncomeRange + IsBorrowerHomeowner + LoanOriginalQuarter +
## MonthlyLoanPayment:Term + Investors:IncomeRange + Investors:Term +
## Term:IsBorrowerHomeowner, data = newdata_comp)
##
## ============================================================================================================
## m10 m11 m12 m13
## ------------------------------------------------------------------------------------------------------------
## (Intercept) -0.034*** -0.033*** 0.054*** 0.054***
## (0.005) (0.005) (0.005) (0.006)
## MonthlyLoanPayment 0.382*** 0.382*** 0.281*** 0.281***
## (0.004) (0.004) (0.004) (0.004)
## Investors 0.018*** 0.026*** 0.260*** 0.260***
## (0.000) (0.005) (0.006) (0.006)
## LP_CustomerPayments -0.017*** -0.017*** -0.018*** -0.018***
## (0.000) (0.000) (0.000) (0.000)
## LP_ServiceFees -0.137*** -0.137*** -0.153*** -0.153***
## (0.004) (0.004) (0.004) (0.004)
## Term: 36/12 0.015*** 0.015*** -0.053*** -0.053***
## (0.002) (0.002) (0.002) (0.002)
## Term: 60/12 0.056*** 0.056*** -0.019*** -0.022***
## (0.002) (0.002) (0.002) (0.002)
## CreditScoreRangeLower 0.154*** 0.155*** 0.155*** 0.155***
## (0.003) (0.003) (0.002) (0.002)
## IncomeRange: $0/Not employed 0.024*** 0.022*** 0.021*** 0.021***
## (0.002) (0.002) (0.002) (0.002)
## IncomeRange: $1-24,999/Not employed 0.010*** 0.010*** 0.008*** 0.008***
## (0.001) (0.002) (0.002) (0.002)
## IncomeRange: $25,000-49,999/Not employed 0.014*** 0.013*** 0.012*** 0.011***
## (0.001) (0.002) (0.002) (0.002)
## IncomeRange: $50,000-74,999/Not employed 0.020*** 0.018*** 0.017*** 0.017***
## (0.001) (0.002) (0.002) (0.002)
## IncomeRange: $75,000-99,999/Not employed 0.024*** 0.022*** 0.021*** 0.021***
## (0.001) (0.002) (0.002) (0.002)
## IncomeRange: $100,000+/Not employed 0.030*** 0.028*** 0.027*** 0.027***
## (0.001) (0.002) (0.002) (0.002)
## IncomeRange: Not displayed/Not employed 0.035*** 0.032*** 0.031*** 0.031***
## (0.001) (0.002) (0.002) (0.002)
## IsBorrowerHomeowner: True/False 0.001*** 0.001*** 0.001*** 0.000
## (0.000) (0.000) (0.000) (0.002)
## LoanOriginalQuarter: Q2/Q1 -0.007*** -0.007*** -0.007*** -0.007***
## (0.000) (0.000) (0.000) (0.000)
## LoanOriginalQuarter: Q3/Q1 -0.007*** -0.007*** -0.007*** -0.007***
## (0.000) (0.000) (0.000) (0.000)
## LoanOriginalQuarter: Q4/Q1 -0.004*** -0.004*** -0.004*** -0.004***
## (0.000) (0.000) (0.000) (0.000)
## MonthlyLoanPayment x Term: 36/12 0.488*** 0.488*** 0.587*** 0.587***
## (0.004) (0.004) (0.004) (0.004)
## MonthlyLoanPayment x Term: 60/12 0.650*** 0.650*** 0.757*** 0.755***
## (0.004) (0.004) (0.004) (0.004)
## Investors x IncomeRange: $0/Not employed -0.005 -0.005 -0.006
## (0.007) (0.007) (0.007)
## Investors x IncomeRange: $1-24,999/Not employed -0.003 -0.001 -0.002
## (0.005) (0.005) (0.005)
## Investors x IncomeRange: $25,000-49,999/Not employed -0.007 -0.003 -0.004
## (0.005) (0.005) (0.005)
## Investors x IncomeRange: $50,000-74,999/Not employed -0.010* -0.006 -0.006
## (0.005) (0.005) (0.005)
## Investors x IncomeRange: $75,000-99,999/Not employed -0.008 -0.005 -0.005
## (0.005) (0.005) (0.005)
## Investors x IncomeRange: $100,000+/Not employed -0.008 -0.004 -0.005
## (0.005) (0.005) (0.005)
## Investors x IncomeRange: Not displayed/Not employed -0.013* -0.012* -0.012*
## (0.005) (0.005) (0.005)
## Investors x Term: 36/12 -0.233*** -0.233***
## (0.004) (0.004)
## Investors x Term: 60/12 -0.253*** -0.253***
## (0.004) (0.004)
## Term: 36/12 x IsBorrowerHomeowner: True/False 0.000
## (0.002)
## Term: 60/12 x IsBorrowerHomeowner: True/False 0.005*
## (0.002)
## ------------------------------------------------------------------------------------------------------------
## R-squared 0.962 0.962 0.964 0.964
## adj. R-squared 0.962 0.962 0.964 0.964
## sigma 0.039 0.039 0.038 0.038
## F 142572.058 105641.845 101835.267 95319.336
## p 0.000 0.000 0.000 0.000
## Log-likelihood 203996.114 204016.227 205886.331 205917.834
## Deviance 169.405 169.344 163.766 163.674
## AIC -407948.228 -407974.454 -411710.661 -411769.668
## BIC -407736.513 -407695.376 -411412.336 -411452.096
## N 111680 111680 111680 111680
## ============================================================================================================
The ANOVA test shows that all variables and interactions in the last model (m11) significantly improve the model, therefore all of them should be kept in the model. From the above table of regression models, we can see that MonthlyLoanPayment alone can explain 89.6% of the variance of LoanOriginalAmount. This is consistent with our plots, indicating that MonthlyLoanPayment is the strongest predictor. The second strong predictor is Term, which means that longer loan term is related to larger amount of loan. The final model explains 96.4% of the variance of loan amount.
LoanOriginalAmount is the predicted variable in this dataset. In this Plot One, this variable has been standardized and log transformed, because the original variable was right skewed, and was not on the same scale with all the other numeric variables in the dataset.
This Plot two describes the strongest correlation related to the predicted variable (r > 0.9). Also, according to the plot, the relationship may be better captured by multiple regression lines, indicating that a third variable may moderate the relationship between monthly loan payment and loan amount.
This Plot Three confirms my speculation from Plot Two. The relationship between Monthly Payment and Loan Amount was moderated by Loan Term, which had three levels: 12 months, 36 months and 60 months. According to Plot Three, with longer loan term, the correlation between monthly payment and loan amount tends to be stronger. In addition, the color of points indicates that with higher income, the loan amount tends to be larger.
The series of analysis have shown that the original amount of loan is related to many other variables. This dataset contains six numeric variables in addition to the amount of loan, as well as four factors. After examining the relationship between loan amount and these other variables with plots and multiple linear regression, I found that all these variables predicted the amount of loan.
A strong positive correlation between LoanOriginalAmount and MonthlyLoanPayment was found, which means that people with larger amount of original loan are more likely to schedule larger amount of monthly loan payment. A further analysis showed that this relationship was moderated by the length of loan term. With longer loan term, larger amount of loan tends to relate to even more monthly payment.
The amount of loan also tends to be larger with more investors, and this relationship was found to be moderated by both loan term and borrow’s income. With longer loan term, the correlation between number of investors and loan amount was not as strong as with shorter term; Similarly, with higher borrower’s income, the correlation was not as strong as with less income.
In general, higher income the borrower had, the larger amount of loan they tend to borrow. The employed borrowers and borrowers with higher credit score also tend to have larger amount of loan compared to borrowers with other employment status. Loan with longer term is also more likely has larger amount.
Home owners tend to have larger amount of loan, but it is also affected by the length of loan term. To be specific, whether borrowers have their own home does not matter much for a short term loan (12 months), but it affects longer term loan (36 and 60 months).
The amount of loan also tends to be larger if the borrower has made higher pre charge-off payment, or if the investors have paid more service fees.
The time when the loan was created in a year seems to affect the amount of loan, Q1 > Q4 > Q3 > Q2 although the difference between Q3 and Q2 was significant but very small.
Although the model has shown that using these variables and the interactions between several of them could predict a high proportion of the variance of loan amount, it still has limitations. Monthly payment explains more than 89% of the variance of loan amount and seems to be a very good predictor; However, there might be some variables that could predict both loan amount and monthly payment. For example, one could think that the borrower’s average expense per month, their marital status, how many children they have, and their estates, stock and other properties should also predict both their monthly payment and loan amount. These variables are not include in the original dataset, for the original dataset does not focus on more personal information of the borrowers. I think with more information considered, the model can be much improved.